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ABSTRACT 

A powerful method to measure the mass profile of a galaxy is through the velocities of tracer particles 
distributed through its halo. Transforming this kind of data accurately to a mass profile M{r), 
however, is not a trivial problem. In particular, limited or incomplete data may substantially affect 
the analysis. In this paper we develop a Bayesian method to deal with incomplete data effectively; we 
have a hybrid-Gibbs sampler that treats the unknown velocity components of tracers as parameters in 
the model. We explore the effectiveness of our model using simulated data, and then apply our method 
to the Milky Way using velocity and position data from globular clusters and dwarf galaxies. We find 
that in general, missing velocity components have little effect on the total mass estimate. However, 
the results are quite sensitive to the outer globular cluster Pal 3. Using a basic Hernquist model with 
an isotropic velocity dispersion, we obtain credible regions for the cumulative mass profile M(r) of the 
Milky Way, and provide estimates for the model parameters with 95% Bayesian credible intervals. The 
mass contained within 260kpc is 1.37x IO^^Mq, with a 95% credible interval of (1.27,1.51) x IO^^Mq. 
The Hernquist parameters for the total mass and scale radius are l.hh^g/® x IO^^Mq and 16.9tm 
where the uncertainties span the 95% credible intervals. The code we developed for this work, Galactic 
Mass Estimator (GME), will be available as an open source package in the R Project for Statistical 
Computing. 

Subject headings: Milky Way — galaxies: dark matter — galaxies: halos — galaxies: satellites 


1. INTRODUCTION 

Almost every galaxy in the universe is assumed to re¬ 
side in a massive, dark matter halo that can extend far 
beyond the visible components of the galaxy. Standard 
methods to determine the mass distribution within vis¬ 
ible portions of galaxies are based on rotation curves or 
velocity dispersion profiles. The former is applicable to 
spiral galaxies, while the latter method is used mainly for 
elliptical galaxies or disk galaxy bulges. At large radii, 
where the mass distribution is presumably dominated by 
dark matter, one can use observations of kinematic trac¬ 
ers to learn about a galaxy’s mass profile. 

The Milky Way (MW) has many distant satellites, such 
as globular clusters, halo stars, planetary nebulae, and 
dwarf galaxies. The kinematic properties of these satel¬ 
lites can be used to learn about the gravitational poten¬ 
tial of the whole system, and thus the Galaxy’s mass 
profile out to large radii. 

One way to use the kinematic data of tracers to es¬ 
timate the mass distribution is via a mass estimator, a 


method first suggested by Hartwick & Sargent (19781 
and a method that avoids using an explicit model. The 
method has endured partly because it uses only the line 
of sight velocities and positions of the tracers. It is use¬ 
ful when no proper motions are available and conversion 
to a Galactocentric refrence frame is impossible. Nowa¬ 
days, however, many MW tracers have proper motion 
measurements, and more continue to become available. 
Thus, it would be beneficial to have a method which 
can incorporate both complete data (i.e. tracers with 3- 
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dimensional space motions in the Galactocentric frame) 
and incomplete data (i.e. tracers with only line of sight 
velocities). In this paper we introduce such a method of 
mass estimation. 

The method used here is based on the phase-space dis¬ 
tribution function (DE), which is a probability distribu¬ 
tion for a satellite in terms of its position r and veloc¬ 
ity V in the Galactocentric reference frame. Using the 
DF for a model and a Bayesian approach to the anal¬ 
ysis, we obtain probability distributions for the model 
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Galactocentric reference frame, but previous studies have 
re-written DFs in terms of the line-of-sight velocity com- 
ponent only, in order to inc orporate incomplete data (e.g. 


Wilkinson & Evans 1999). This does not take full ad- 
vantage of the complete data that is available, which is 
an issue when the method may be susceptible to large 
uncertainties due to small sample size (as discussed by 
Zaritsky et al.|[T989|. Furthermore, rewriting the DF in 


terms of the fme-ot-sight velocity can be mathematically 
difficult. In our study, we introduce a generalized ap¬ 
proach via the Bayesian framework, whereby it is easy 
to incorporate complete and incomplete data simultane¬ 
ously, and also unnecessary to rewrite the DF in terms 
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of the line-of-sight velocity. 

The purpose of this paper is to lay out the method 
fully, and set the groundwork for future studies with a 
range of DFs and datasets. We also test the method with 
simulated data, and do some preliminary analysis with 
the method as it applies to the Milky Way. _ 

The outline of this paper is as follows. Section |2.1| 
briefly desc ribes the theoretical background of DFs, and 
Section 2.2 introduces two models that already have an- 
alyt ic D Fs, which we use in this work. Next, in Sec- 
tion |2.3[ we review how Bayes’ Theorem can be used with 
the DFs to obt ain para meter estimates of the model. Sec¬ 
tions [2^ |2.5[ and |2.6| introduce the method for incorpo- 
rating both complete and in com plete data via a hybrid- 
Gibbs Sampler, and Section [2?7| discusses the techniques 
used to assess convergence of the Markov chains. 

We first apply and test our method on simulated data 
(described in Section]^ and then apply the method to 
some kinematic data of satellites orbiting the Milky Way 
(described in Section]^ . The results of these analyses are 
presented and discussed in Sections and [^respectively. 
Many future prospects are also discussed in Section 

2. BACKGROUND 

This section provides background information and no¬ 
tation about distribution functions, the models we use 
in our analysis, and the methods of Bayesian inference 
as they apply to the current pr oblem. Addit ional details 
and discussion can be found in Eadie (2013). 


2.1. Distribution Function 

The Distribution Function (DF), f{r, v), is a probabil¬ 
ity density function that gives the probability of finding a 
particle with a position r and velocity v within a phase - 
space volume element d^rcfiv (Binney & Tremaine 2008). 
Like any probability density, the DL' integrates to one: 


J /(r, v)d^rd^v = 1. 


( 1 ) 


Eq. [2 is often renormalized so that the DF integrates to 
a quantity of interest, such as the total mass Mtot- 


j /(r,v)dWv = Mtot. 


( 2 ) 


However, in a Bayesian framework the DF as defined in 
eq. [l] is used, and thus the left-hand-side of eq. is di¬ 
vide by Mtot- Thus, it is important that models nave a 
finite mass in a Bayesian analysis— if the mass is infinite, 
then the DF is not a proper probability distribution. 

A DF can be specified by use of Jeans’ Theorem, which 
states that any solution of the time-independent colli¬ 
sionless Boltzmann equation is a function of the phase- 
space coordinates (r, v) only. In a time-independent sys¬ 
tem, the Hamiltonian H = 'y + <i)(r) is always an inte¬ 
gral of motion, and if the system is also spherical then the 
magnitude of the angular momentum, L, is an integral of 
motion too. Therefore, any non-negative function /(iJ) 
or /(iJ, L) will be a solution to the time-independent 
collisionless Boltzmann equation, and thus a DF for the 
system. Whether or not / is a function of H or both H 
and L determines the velocity dispersion of the system; 
f{H) corresponds to an isotropic system, and f{H, L) 
corresponds to an anisotropic system. 


In practice, a DF corresponding to an isotropic, spher¬ 
ical, self-consistent system is usually written in terms of 
5, the relative energy per unit mass, defined as 


£ = + «'(r) 


( 3 ) 


where v is the speed of a particle at a distance r from 
the center of the system, and 'k(r) is the relative gravita¬ 
tional potential of the system at r, as defined in Binney] 
& Tremaine (2008). Particles with £ < 0 are unbound 
and require / = 0. If the system has an anisotropic ve¬ 
locity dispersion, then the DF is written as a function of 
both £ and the angular momentum L. 

2.2. Models 

Models with analytic DFs are preferable to empiri¬ 
cal distribution functions in theoretical analyses because 
they allow for easy sampling of the distribution, and also 
save computation time by avoiding numerical integra¬ 
tion. Finding a DF that models a realistic galaxy is a dif¬ 
ficult task, however, because galaxies are often composed 
of multiple subsystems such as a bulge, a stellar halo, a 
dark matter halo, and possibly a disk. Finding a single 
phase-space distribution function that is self-consistent, 
analytic, and that describes the intricate features of a 
galaxy is very challenging. 

An empirical luminosity profile that has been success¬ 
ful in fitting the surface b rightness profiles of ell iptical 


galaxies and bulges is the jde Vaucouleursj (1948) 
profile. A generalizatio n of the profile is j^l/n 


which was introduced by Sersic (1968). Due to the suc¬ 
cess of theorists have tried to develop distribution 
functions that can re produce the prof ile. The analytic 
models introduced by Jaffe (1983) and Hernquist ( 1990| 
fit the type galaxies well for most radii. 

In this work, primarily for the purpose of testing the 
method, we use the Hernquist and Jaffe models because 
of their analytic simplicity and their ubiquity. The Hern¬ 
quist model also has the benefit of having more than one 
analytic DF— one has an isotropic velocity dispersion, 
and there are a few that are anisotropic. Furthermore, 
both the Hernquist and Jaffe models are self-consistent 
and have a finite total mass, making consistent numeri¬ 
cal computations feasible. For these reasons, we use the 
Hernquist-style models to lay out the methodology of 
our Bayesian approach and the derivation of mass pro¬ 
file credible regions. In future work, we will extend the 
method to include models with non-analytic DFs and 
non-hnite mass distributions, such as thejNavarro et al.j 
([19^ (NFW) model. 


Hernquist (1990) introduced a halo model that is a self- 
consistent, analytic potential-density pair. With G = 1, 
the gravitational potential of the Hernquist model is 


X Mtot 

= —— 

r -I- a 

and the mass density profile is 

aMtot 


p{r) = 


27 rr (r -|- a)" 


( 4 ) 


( 5 ) 


where Mtot is the total mass of the system, and a is the 
scale radius. Integrating over a sphere, the Hernquist 
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cumulative mass profile is then, 
M(r) = Mtot 


ir + a)^ 


( 6 ) 


Hernquist (1990) provides two DFs for their model 
that are written m terms of elementary functions: one 
for an isotropic velocity dispersion, and a second fo r an 
anisotropic vel ocity dispersion of the Osipkov (1979) and 
Merritt (1985) type (hereafter OM-type). I'he (JM-type 
allows the anisotropy to vary as a function of r, and in¬ 
cludes a constant parameter called the anisotropy radius 


P{r) = 


(7) 


The parameter Tq controls the degree of radial anisotropy 
in the system at different radii. As —>• oo, /3(r) —0 
(completely isotropic). 

The third Hernquist DF used in this research has a con¬ 
stant anis otropy (3 = 0.5, which was derived by Evans fc' 
An (2006 ). We consider this model because recent re- 
search by Deason et al. (2013) showed that blue horizon¬ 
tal branch (HHBj stars have a radially biased velocity 
anisotropy of 0.5 between 16 and 28kpc, suggesting that 
/3 may be approximately constant for most of the stellar 
halo. 



(8) 


Mtot 

(9) 

47ra,r2 (1 -|- r/a,)^ 


We also consider the isotropic Jaffe (1983) model in 
this work, which has mass profile and potential 


$j(r) = - 


where oj is the scale radius. 

Overall, we fit the Milky Way data to four dif¬ 
ferent models: three Hernquist models with different 
anisotropies (isotropic, OM-type, constant anisotropy 
{j3 = 0.5), and the isotropic Jaffe model. 

2.3. Bayes Theorem and Parameter Estimation 

Bayes’ Theorem is named after Thomas Bayes (1701- 
1761) and was introduced posthumously by Richard 
Price dBayes & Price||1763|). Using the rules of condi¬ 
tional probabilities, Bayes showed that the conditional 
probability p( AIH) is. 


where p{y\6) is called the likelihood, p{9) is the prior 
probability on the parameters, and p (y) is the marginal 
probability of the data. Because the marginal probability 
does not depend on 6, and with fixed y can be considered 
a constant, it is common practice to sample the unnor¬ 
malized posterior probability, 

p{e\y) exp{y\e)p{e). ( 12 ) 

The pioneering work by ||Bahcall & Tremaine ( 1981|) 
showed that the DF determines the likelihood p{y\&). 
For example, the probability of the first satellite in our 
data set having ri and Vi, given the model parameters, 
is f{ri,vi\0). Assuming all n satellites in the data set 
are independent, the probability of their corresponding 
positions and velocities is the product of the DFs, and 
thus the likelihood is 


pivW) = Y[f{n,v,\9). 


2=1 


Therefore, eg becomes 


oc ]g/(ri,t)i|e)p(0) , 


(13) 


(14) 


Sampling eq. is usually done via a Markov Chain 
Monte Carlo (MCMC) method, which creates a Markov 
chain— a sequence of random variables, 0*, where t = 
1,2,3... represents the position in the chain. Every ran¬ 
dom variable i n the chain depends only on the variable 
before it, 9*~^ ( Gelman et al.[2003 1. When MCMC algo¬ 
rithms are used to sample a Bayesian posterior density, 
then by construction, the Markov chain is a collection of 
parameter vectors that hav e th e same stationary distri¬ 
bution as the posterior (eq. [TT 


We apply the Metropolis algo rithm (Metropolis & 
Ulam|1949 Metropolis et al.|1953 ) to sample eq. TT 'i'he 
Metropolis algorithm is iterative and creates a Markov 
chain whose stationary distribution is proportional to the 
Bayesian posterior probability in question. The Markov 
chains in t his work are constructed as follows (Gelman 

erA^[2003t : 


1. Draw a trial value 9* from a symmetric proposal 
distribution 

2. Calculate d = 


p{A\B) 


p{B\A)p{A) 

p{B) 


( 10 ) 


now known as Bayes’ Theorem. _ 

Bayesian inference involves using eq. in data anal¬ 
ysis to obtain probability distributions about model pa¬ 
rameters. Bayesian inference returns a probability dis¬ 
tribution for parameters given the data and a prior dis¬ 
tribution on the parameters. 

When Bayes’ Theorem is used for Bayesian inference, 
it is re-written in terms of the vector of model param¬ 
eters 9 and the data y, and is sometimes referred to as 
Bayes’ rule. The Bayesian posterior probability for 9, 
given some data y, is then 


p{S\y) 


p{y\9)p{9) 

p{y) 


3. If d > 1, then accept 9* as 0* 

(a) set 0* = 0* 

(b) return to step 1 

4. If instead d < 1, then accept 0* with probability d 

(a) draw a random number z from the uniform 
distribution U (0,1) 

(b) if d > z then accept 0* as in step 3, and return 
to step 1 

(c) if d < z then reject 0* 

i. set 0* = 0*“^ 

ii. return to step 1 


( 11 ) 
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Accepting 6* only when d > z ensures that the 6 values 
are accepted with probability proportional to the poste¬ 
rior, provided that the chain has converged to the tar¬ 
get distribution. The above process is repeated N times, 
resulting in a Markov chain with N values of 9 which 
represents samples from the posterior. 

Because a Bayesian analysis leads to distributions for 
parameters, the results are arguably easier to interpret. 
The Markov chains can be used to acquire estimates, 
uncertainties, and probabilities pertaining to model pa¬ 
rameters. Furthermore, the uncertainties can be carried 
forward to subsequent modeling and analysis. 

A Bayesian analysis can also easily include nuisance 
parameters— parameters in the model whose values are 
unknown but not necessarily of interest to the researcher. 
This feature turns out to be useful in our current problem 
of galaxy mass estimation; sometimes we do not know 
the tangential velocity component of a satellite object, 
but we can treat that unknown component as a nuisance 
parameter in the model. 


2.4. Nuisance (vt) Parameters 

In a Galactocentric coordinate system, the total speed 
of a satellite in orbit around the Galaxy can be written 

V = \Jvl+ v'i (15) 

where Vj. and Vt are the radial and tangential components 
respectively. In turn, 

Vt=vl + vl. (16) 

The total speed of a satellite is needed for the DF 
f{r,v\9) in the likelihood of Bayes’ theorem. However, 
proper motion measurements are not available for all 
satellites. In many cases, distant tracers of the MW have 
only line-of-sight velocity measurements with respect to 
our position in the Galaxy. We want to use this satel¬ 
lite information in our analyses, but without a proper 
motion, the line-of-sight velocity in the local standard of 
rest frame does not give us Vr or vt in Galactocentric co¬ 
ordinates. For very distant objects, the line-of-sight ve¬ 
locity is approximately v\os ~ Vr, since the angle created 
by the location of the Sun, the satellite, and the center 
of the galaxy is quite small. However, we still have no 
value for vt- If we treat these unknown ut’s as nuisance 
parameters in the model, then Bayes’ rule reads, 

n 

p{d\y) oc Yl f {ri,Vr^,)\e ,Vt^t)p{9)p{vt,i) (17) 

i=l 

where p(vt^i) is the prior probability on the tangential 
velocity of the ith satellite, to be discussed in section [2?6l 


2.5. The Gibbs Sampler 

When nuisance (vt) parameters are present , we use a 
Gibbs sa.mpler , which was first introduced by|Geman &: 


Geman (19841 in the area of image processing) and then 


ad apted to iterative simul a tions in the study of stat istics 
by |Tanner & Wong| (|1987D . |Gelfand & Smith|(|1990D then 
showed how to apply it to Bayesian inference. Since then, 
the Gibbs sampler h as been applied to many problems 


_mo_ 

( Gelman et aLjpOOS and references therein). 


The Gibbs sampler is sometimes called alternating 
conditional sampling, and can be very useful i n multi¬ 
dimensiona l problems where 9 = (0i,...,0„) (jGelman 
et al.|2003 1. The Gibbs algorithm samples each of the pa¬ 
rameters ( 01 , ■■■,9n) one at a time, based on the current 
value of all of the other parameters and the conditional 
probability given those parameters. 

Consider 0i, the zth parameter in a model with n pa¬ 
rameters, and let 9-i represent all of the other parame¬ 
ters. Next, let t be the iteration of the chain— the 
chain that will be a sample of the posterior distribution 
p{9\y). In the Gibbs sampler, each 9i is sampled one 
at a time based on its conditional probability given the 
current values of all of the other parameters, 


pmd--\y) 


(18) 


where 9*_-^ represents the other parameters at their cur¬ 
rent value. 


nt—1 _ /nt nt /nl 

u_i — [Ui, 


+ 1 5 ’"I'-'n 




In our work, we do not directly sample the conditional 
distributions (eq. 181 because they are usually not avail¬ 
able. Instead, we use a Metropolis step to update the 
conditional distributions. Thus, we employ a hybrid- 
Gibbs sampler: we use a symmetric proposal distribu¬ 
tion, so that the accept/reject condition of the trial pa¬ 
rameter 9 * fol lows the same algorithm as that described 
in sectionl^^ except d is now 


d = 


p{n\s^-^,y) 

p{or^\ol-^,y) 


Gelman et al. 

20031. 

In the prob 

em at 


is more efficient than a standard Metropolis algorithm. 
The latter method samples all parameters simultane¬ 
ously, while the former samples parameters individually. 
Under the Metropolis algorithm, if even one parameter 
suggestion is highly improbable, then the entire vector 
of parameters is likely to be rejected. Therefore, a high¬ 
dimensional Markov chain may take an extremely long 
time to walk through parameter space and converge to 
the posterior distribution. By contrast, the hybrid-Gibbs 
sampler is much more efficient in our high-dimensional 
Markov chain (there are 2 model parameters, and 44 tan¬ 
gential velocity parameters for the Milky Way data dis¬ 
cussed below). The parameters Mtot and a are sampled 
simultaneously, based on the current Vt parameters, and 
the Vt parameters are sampled individually based on the 
current values of all the other parameters. 

Using the hybrid-Gibbs sampler for the v^s allows us 
to obtain a probability distribution for each vt param¬ 
eter efficiently. We can look at the probability distri¬ 
bution for each vt and make a prediction of the most 
probable Vt value for each satellite. Although we find 
that the resulting vt distributions are quite diffuse, and 
a meaningful prediction of Vt cannot be made from them, 
the hybrid-Gibbs sampler method is nevertheless efficient 
and in general does not affect the mass estimate of the 
Galaxy (as will be shown below). If we assume that the 
satellite is bound to the galaxy, then by setting eq. to 
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zero we obtain an upper limit on the tangential velocity, 

Vt,7nax = V25'(r) - (19) 

2.6. Prior Probabilities 

In a Bayesian analysis, the choice of a prior can be 
thought of as a chance for the researcher to state plainly 
and explicitly the prior assumptions. When little is 
known about the problem at hand, it is common to use a 
noninformative prior, so that the information contained 
in the likelihood is not overwhelmed by information con¬ 
tained in the prior. 

In this preliminary analysis, we use uniform priors for 
all of the model parameters because we assume little 
about the mass and scale of the system. The uniform 
prior for each parameter 6 is. 


p{0) = 


1 


0m,n.x 0n 


( 20 ) 


where 0mm and 9max are the lower and upper bounds of 
the uniform distribution. In practice, the parameters are 
sampled in the natural log-space to ensure that the total 
mass and scale radius are always positive. The Markov 
chain values are then exponentiated before examination 
of the posterior. The bounds {Omin,0max) that we use 
for Mtot and a are on the order of (10®, 10 ^®)Mq and 
(10“^, 10®)kpc respectively. 

When the tangential velocities are treated as nuisance 
parameters and sampled in the Markov chain, they too 
require a prior. The tangential velocity is a 2-dimensional 
vector on the plane of the sky, so we would like the prior 
on vt to be uniform in v^. Because we are sampling vt 
and not the squared tangential velocity, the uniform prior 
on vf needs to be transformed to one for Vt- 


Her e we use Jeffreys’ invariance principle (Jeffrey 
19391. Suppose a parameter 9 has a prior distribu- 
tion p{9), and that a one-to-one transformation is sub¬ 
sequently performed on 9 such that (j) = h{9). Then the 
Jeffrey’s prior for (j) which expresses the same belief as 
that of p{9) is 


p{(j)) =p{9) 


d9 

d(j) 


= pi9)\h'i9)\- 


( 21 ) 


( Gelman et al.|[2003 |. 

Following equation 
h{9) = \/v\ 


let 9 = vf and (j) = Vt, so that 


21 

^ Then the prior on vt is given by 


P{vt) = 


2vt 


_ ^l2 

^t.max 


( 22 ) 


t^min 


The minimum tangential velocity, Vt^mim is zero, and the 
maximum tangential velocity, Vt.max, is a large constant. 
Note that if a value of Vt that makes a particle unbounded 
is suggested in the Markov chain, it will be rejected via 
the likelihood. 

2.7. MCMC chains and Assessing convergence 

The computer code created for this research is writ¬ 
ten for the R Project for Statistical Computing (R), an 
open source softw are environment for statistical com put- 
ing and graphics (R Development Core Team|2012 1 with 
many well-developed and efficient statistical diagnostic 


tools. Recently, R has gained popul arity in astronomy 
and t he field of astrostatistics (e.g. |Feigelson & Babu| 
20121, and our code is being developed into an R pack- 
age, called Galactic Mass Estimator (hereafter GME). 

GME takes data of the form (r, w^Wt) in Galactocen- 
tric coordinates, allows the user to select one of five DFs, 
constructs three Markov chains in parallel, and then com¬ 
bines them into a final, single chain after convergence 
conditions have been met. The result is a single Markov 
chain that represents samples from the posterior distri¬ 
bution for the model parameters, giv en the data. The 
SNOW package (Tierney et al. (2013l) i s used for paral¬ 
lel co mputing, and the CUBA package (Plummer et al.j 


20061 is used for convergence diagnostics. 

Many diagnostics to assess convergence of a Markov 
chain have been developed, and most of these methods 
use multiple chains. One advantage of running multiple 
chains is that the initial values can be dispersed widely 
in the parameter space, and then convergence can be 
approximated when they appear to reach a common sta¬ 
tionary distribution. This approach allows more explo¬ 
ration of the parameter space and makes it less likely for 
a local maximum to be mistaken for the mode of the pos¬ 
terior. Furthermore, using multiple chains on the same 
data set allows estimates of convergence to be obtained 
in a m ore reliable and quantita tive manner than a single 
chain. Gelman & Rubin (19921 suggest using the statis¬ 


tic R to assess the mutual convergence of parallel chains: 


R = 


'var+ (■0|y) 
W 


(23) 


In equation 23 var^ {4’\y) is the marginal posterior vari¬ 
ance of the estimand [parameter]^ which is essentially a 
weighted average of the withi n-chain variance W, and 
between-chain variance B (see Gelman et al. 2003 for 
more details). In practice, R is calculate d for each pa¬ 
ramet er separately; to assume convergence, Gelman et al. 
(20031 recommend a value of i? < I.l for every param¬ 


eter, and this is the criterion we use. The R statistic is 
available in the R Software S tatistical Gomputing Lan¬ 
guage in the CODA package ( [Plummer et al.|[2006 ). 

In this work, the three Markov chains’ starting val 


starting values 
are widely dispersed in the parameter space, and each 
chain js run for i = 10® iterations. After this initial 
run, R is calculated for each parameter. If any of the 
parameters have R > 1.1, then it is assumed that the 
chains have not converged, and the last parameter values 
in each chain are used as the initial parameters in three 
new chains. The process is repeated until all parameters 
across all three chains have R < 1.1, at which point 
convergence is assumed. The final sample of the posterior 
distribution is created by combining the last halves of 
the three chains, thus providing 1500 parameter vector 
samples. Prior to this final step, however, we check the 
effective sample sizes of the Markov chains. 

In general, the draws in a Markov chain are not truly 
independent ; some autocorrelatio n exists in the sequence 
of samples (Gelman et al. 2003). The effective sample 
size is the equivalent number of independent samples: 


Ueff = mn- 


var "^ (^|y) 
B 


(24) 
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where m is the number of parallel Markov chains and 


n is the number of draws in each chain (Gelman et al. 


20031. If the draws in all chains were perfectly indepen¬ 


dent, then the number of independent draws would be 
mn. However, the draws within a chain are general au- 
tocorrelated, and so B > vSi'^ {ip\y), and Ue// < mn. 
An effective sample size of at least 100 is necessary to 
obtain reliable first-moment statistics such as the mean 
and median, while an effective sample size over 200 is 
needed for second-order moments. 

In our code, we use the effectiveSi ze function in the 
CODA package dPlummer et al.||2006|), and find that all 
parameters had effective sample sizes greater than 300 for 
all models when i = 10'*. An acceptance rate between 
20 and 30% is required for the Vt parameters, and an 
acceptance rate of 30-40% is required for the model pa¬ 
rameters. The final chain (15000 samples) for each model 
is visually inspected to ensure that the three chains did 
not reach very different maxima in the posterior distri¬ 
bution. 


3. SIMULATIONS 

Prior to data analysis, we explored the statistical prop¬ 
erties of our Bayesian estimates under repeated sampling, 
using simulation. Unfortunately, a Bayesian analysis 
cannot be repeated when the data is real (i.e. for a single 
data set). A Bayesian analysis can be repeated, however, 
with simulated data sets produced from the same DF, 
and analysed independently using the same model. By 
examining the range of parameter estimates, the aver¬ 
age behaviour of the model on this type of data can be 
explored. Simulations and analyses of trivial cases (i.e. 
when the model and data have the same distribution) are 
also an effective way to test code. 

It is expected that when the simulated data come from 
the same DF as that of the model, then on average the 
Bayesian parameter estimates should be correct. Fur¬ 
thermore, quantities like uncertainties and credible inter¬ 
vals should be reliable (e.g. a 50% credible region should 
contain the true parameter values 50% of the time). In 
contrast, when the simulated data comes from a different 
DF than that of the model, biases in the estimates may 
occur, and the credible regions may become unreliable 
(e.g. overconfident). Moreover, we need to investigate 
whether or not treating missing vt measurements as pa¬ 
rameters affects other parameter estimates, regardless of 
whether or not the data and model share the same DF. 

We simulate mock observations of 100 satellites orbit¬ 
ing a galaxy whose gravitational potential follows the 
Hernquist model. The mock tracer observations include 
their distances r and velocity components Vr and Vt ■ We 
explore the effects of assuming an isotropic Hernquist 
model in the following three scenarios: 

1 . isotropic data with complete velocity vectors 

2. isotropic data with 50 unknown Vt values 


3. constant anisotropic data from the /3 = 0.5 Hen- 
quist model, with 50 unknown vtS 

For each scenario, we create 500 data sets with 100 par¬ 
ticles each. The Bayesian analysis is perf o rmed on each 
set of data, as described in Sections 2)^ - 12.7[ yielding 
500 Markov chains for each scenario. From these chains. 


statistics such as the mean parameter estimate and cred¬ 
ible intervals are calculated. 

For all the simulations, we use Mtot = 10 *^Mq and 
a = 15kpc to generate the data. Our choice for the total 
mass is based on many other studies that have shown the 
Milky Way’s mass to be close to this order of magnitude. 
For numerical simplicity, the following units are used in 
our code: the gravitational constant G = 1, r is measured 
in kiloparsecs (kpc), velocity components are measured 
in 100km s“*, and mass is measured in 2.325 x 10® Mq. 

4. KINEMATIC DATA FOR THE MILKY WAY 


In principle, any well-defined object orbiting the 
Galaxy with a measured distance from the Galactic cen¬ 
ter and at least one velocity measurement may be used 
to estimate the mass of the Milky Way. In this work, as a 
first run, we use only globular clusters (GCs) and dwarf 
galaxies (DGs). It is possible to measure the proper mo¬ 
tions and line-of-sight velocities of these tracers in the 
Galaxy’s halo, and to convert these measurements into 
Galactocentric coordinates. Indeed, many of the kine- 


made (e.g. 

Dinescu et al. 19991 Casetti-Dinescu et al. 

2010 2013 

Boylan-Kolchm et al.(20l3l). However, the 


measured, and so the conversion from our frame of refer¬ 
ence to a Galactocentric one cannot be performed. Nev¬ 
ertheless, the line-of-sight velocities of these objects are 
available, and could contain useful information about the 
Galaxy’s mass profile. Thus, we incorporate some of this 
incomplete data into our analysis. 

The data used in this research are in Galactocentric co¬ 
ordinates (see Table [I|). The first 59 objects are GCs, and 
the last 29 are DGs. Note that 26 GCs and 18 DGs listed 
do not have tangential velocities, because they have no 
proper motion measurements. The Galactocentric radial 
velocities for these data must be approximated, for which 
we assume Vr ~ uios. We use this approximation only for 
objects with jcosy] > 0.95 (where 7 is the angle sub¬ 
tended by the line connecting the Sun and the Galactic 
Centre, from the object), guaranteeing that any further 
adjustment to ?;ios will be small. We also exclude the 
following clusters, even though they have |cos 7 | > 0.95, 
because they are either associated with the Sagittarius 
dwarf galaxy or their measurements suffer from high ex¬ 
tinction: Djorg 1, NGC 6401, NGC 6715, NGC 6544, 
NGC 6715, Pal 6 , Terzan 1, Terzan 6 , Terzan 7, and 
Terzan 8 . 

The GC data are taken from Dinescu et al. (|1999 ) 


Casett i-Dine scu et a_L _( ^0101, Gasetti-Dinescu et al. 
12013D, and |Harris| (|l996t, while data for six of the 
dwarf galaxi e s are t aken from Soh n et al. (|2013) (Leo I), 
Pryor et al. (2014|) (Draco), and |Boylan-Kolchin et al. 
(I 2 IJI 3 I) (Fornax LMC, SMC, and Scuh 


1 20131 (Fornax, LMC, SMC, and Sculptor), i’he rest 
of the dwarf galaxy data, which do not have tangen- 
tial velocitie s, are from the compilation given in | Watkins] 


et al. (2010) and references therein. Uncertainties in the 


Watkins et al. (2010) dwa rfs’ Vr values are t aken from 
the HyperLeda Catalogue ( Paturel et al.||2003 ), with the 
exception of those for Corna Be renices, Sagittarius, an d 
Sextans, which a re ta ken from [Simon fc Geha (2007), 
Ibata et al.|(|1997j), andjwalker et al.|(|2006|) respectively. 
'I'he r-values in 'I'able are based on mean magnitudes 
of RR Lyrae and horizontal branch stars, and are uncer- 
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TABLE 1 

Milky Way Kinematic Data 


Object 

(kpc) 

Vr 

(km s“^) 

AVr 

Vt 

(km s“^) 

Avt 

COS 7 

Object 

(kpG 

Vr 

(km s“^) 

<1 

Vt 

(km s“^) 

Avt 

COS 7 

NGC 104 

7 

17.0 

0.2 

171.0 

22.0 

0.15 

NGG 6540 

3 

0.0 

1.4 

- 

- 

-0.97 

NGC 288 

11 

16.0 

0.4 

59.0 

18.0 

0.75 

NGG 6569 

3 

-0.2 

5.6 

- 

- 

0.95 

NGG 362 

9 

55.0 

0.5 

85.0 

31.2 

0.61 

NGG 6864 

15 

-1.1 

3.6 

- 

- 

0.96 

NGG 1851 

16 

186.0 

0.6 

170.0 

42.4 

0.89 

IG 1257 

18 

-66.5 

2.1 

- 

- 

0.99 

NGG 1904 

18 

93.0 

0.5 

83.0 

44.7 

0.94 

Arp 2 

21 

153.0 

10.0 

- 

- 

0.99 

NGG 2298 

14 

-58.0 

1.3 

100.0 

52.7 

0.89 

NGG 7492 

25 

-97.4 

0.6 

- 

- 

0.95 

Pal 3 

85 

-247.0 

8.4 

242.0 

121.5 

1.00 

NGG 5824 

26 

-117.7 

1.5 

- 

- 

0.98 

NGG 4147 

19 

57.0 

1.0 

161.0 

65.7 

0.93 

Pal 13 

27 

192.4 

0.3 

- 

- 

0.95 

NGG 4590 

9 

-99.0 

0.6 

300.0 

35.6 

0.69 

NGG 5694 

29 

-228.1 

0.8 

- 

- 

0.98 

NGG 5024 

18 

-106.0 

4.1 

250.0 

86.5 

0.90 

NGG 6229 

30 

22.6 

7.6 

- 

- 

0.96 

NGG 5139 

6 

-31.0 

0.7 

65.0 

14.1 

0.05 

Whiting 1 

35 

-103.5 

1.8 

- 

- 

0.98 

NGG 5272 

12 

2.0 

0.4 

164.0 

24.5 

0.75 

Pal 2 

35 

-104.4 

57.0 

- 

- 

1.00 

NGG 5466 

16 

254.0 

0.3 

216.0 

66.8 

0.88 

Pal 15 

38 

147.8 

1.1 

- 

- 

0.99 

Pal 5 

16 

-11.0 

16.0 

62.0 

38.0 

0.95 

NGG 7006 

39 

-185.2 

0.4 

- 

- 

0.98 

NGG 5897 

7 

49.0 

1.0 

138.0 

59.4 

0.79 

Pyxis 

41 

-195.2 

1.9 

- 

- 

0.98 

NGG 5904 

6 

-313.0 

0.5 

234.0 

39.6 

0.33 

Pal 14 

72 

165.4 

0.2 

- 

- 

1.00 

NGG 6093 

3 

60.0 

4.1 

85.0 

28.2 

0.67 

NGG 2419 

90 

-26.4 

0.5 

- 

- 

1.00 

NGG 6121 

6 

-58.0 

0.4 

25.0 

22.6 

-0.91 

Eridanus 

95 

-141.0 

2.1 

- 

- 

1.00 

NGG 6144 

3 

109.0 

1.1 

137.0 

33.3 

0.47 

Pal 4 

111 

50.5 

2.1 

- 

- 

1.00 

NGG 6171 

4 

20.0 

0.3 

156.0 

36.9 

-0.29 

AM 1 

125 

-41.6 

20.0 

- 

- 

1.00 

NGG 6205 

8 

279.0 

0.9 

129.0 

35.0 

0.48 

Fornax 

140 

-31.8 

1.7 

196.0 

29.0 

1.00 

NGG 6218 

5 

-21.0 

0.6 

168.0 

22.0 

-0.46 

Leol 

261 

167.9 

2.8 

101.0 

34.4 

1.00 

NGG 6254 

5 

-53.0 

1.1 

178.0 

28.3 

-0.59 

LMC 

49 

93.2 

3.7 

346.0 

8.5 

0.99 

NGG 6341 

9 

70.0 

1.7 

46.0 

26.2 

0.61 

SMG 

60 

6.8 

2.4 

259.0 

17.0 

0.99 

NGG 6362 

5 

-40.0 

0.6 

134.0 

20.5 

0.25 

Sculptor 

87 

79.0 

6.0 

198.0 

50.0 

1.00 

NGG 6397 

6 

18.0 

0.1 

166.0 

16.3 

-0.83 

Draco 

92 

-98.5 

2.6 

210.0 

25.0 

1.00 

NGG 6584 

6 

150.0 

15.0 

185.0 

55.9 

0.88 

BootesI 

57 

106.6 

1.0 

- 

- 

0.99 

NGG 6626 

3 

8.0 

1.0 

172.0 

26.4 

-0.87 

BootesII 

43 

-115.6 

5.0 

- 

- 

0.98 

NGG 6656 

5 

172.0 

0.6 

214.0 

31.9 

-0.94 

Canes Venaticil 

219 

76.8 

1.0 

- 

- 

1.00 

NGG 6712 

4 

208.0 

0.6 

132.0 

21.5 

-0.08 

Canes Venaticill 

150 

-96.1 

1.0 

- 

- 

1.00 

NGG 6752 

5 

-19.0 

1.5 

200.0 

11.4 

-0.50 

Carina 

102 

14.3 

1.0 

- 

- 

1.00 

NGG 6779 

9 

172.0 

0.9 

39.0 

58.1 

0.63 

ComaBernices 

45 

82.6 

5.0 

- 

- 

0.98 

NGG 6809 

4 

-181.0 

0.4 

119.0 

30.4 

-0.49 

Hercules 

141 

142.9 

1.0 

- 

- 

1.00 

NGG 6838 

7 

3.0 

0.2 

180.0 

17.8 

-0.05 

LeoII 

235 

26.5 

8.0 

- 

- 

1.00 

NGG 6934 

12 

-305.0 

1.6 

124.0 

93.0 

0.86 

LeoIV 

154 

13.9 

1.0 

- 

- 

1.00 

NGG 7078 

10 

-74.0 

0.6 

141.0 

34.7 

0.70 

LeoV 

175 

62.3 

3.0 

- 

- 

1.00 

NGG 7089 

10 

46.0 

0.9 

331.0 

63.9 

0.74 

Sagittarius 

16 

166.3 

60.0 

- 

- 

0.93 

NGG 7099 

7 

14.0 

1.0 

120.0 

30.8 

0.46 

Seguel 

28 

113.5 

1.0 

- 

- 

0.97 

NGG 5634 

21 

-0.8 

6.6 

- 

- 

0.95 

Segue2 

41 

39.7 

1.0 

- 

- 

0.99 

NGG 6284 

8 

0.3 

1.7 

- 

- 

0.98 

Sextans 

89 

78.2 

1.0 

- 

- 

1.00 

NGG 6356 

7 

0.6 

4.3 

- 

- 

0.97 

UrsaMajorl 

101 

-8.8 

1.0 

- 

- 

1.00 

NGG 6426 

14 

-0.5 

23.0 

- 

- 

0.96 

UrsaMajorll 

36 

-36.5 

2.0 

- 

- 

0.99 

NGG 6441 

4 

-0.0 

1.0 

- 

- 

0.95 

UrsaMinor 

77 

-89.8 

8.0 

- 

- 

1.00 

NGG 6453 

4 

-0.9 

8.3 

- 

- 

0.98 

Willmanl 

42 

33.7 

2.0 

- 

- 

0.98 


Note. — Columns from left to right: objects’ names, Galactocentric distance, radial velocity, uncertainty in radial velocity, tangential 
velocity, uncertainty in tangential velocity,and COS7. All data are in Galactocentric coordinates (r, Ur, vt) as described in Section [2.4[ with 
the exception of GCs and DGs that lack tangential velocities (see text). Conversions from line-of-sight and proper motion measurements 
to Galactocentric measurements were completed by the studies mentioned in Section]^ 


tain to typically 5% (see Harris 1996). The uncertain¬ 
ties associated with r and Vr, and the differences in the 
LSR assumed motion used among the different studies 
are ^ 15 km/s, and thus unimportant compared to the 
uncertainties associated with the Vt values. 

In the following analysis, we specifically assume (a) a 
spherical Hernquist-like or Jaffe-like halo potential, (b) 
equal weights for all data points, (c) no net rotation of 
the halo, and (d) that all tracers are bound to the Galaxy. 


5. RESULTS 

5.1. Simulation Results 


Figurel^shows the distribution of the mean parameter 
estimates from scenario[ll Black dots are the mean of the 
estimates, and red dashed lines are the true parameter 
values. On average, the estimates are unbiased within 
one standard deviation (sd), and the sd of the chains is 
roughly equal to the sd of the estimates. 

Because the Markov chain represents the posterior 
distribution, we can also calculate credible regions — 
Bayesian analogues of confidence intervals — for the 
M{r) profile. An example of the mass profile credible re¬ 
gions for one data set is shown in Figure!^ where shades 
of teal show the 50, 75, and 95% credible intervals as a 
function of r. Credible regions are found by calculating 
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CO 


o 

CO 
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o 

C\J 


o 


0.7 0.8 0.9 1.0 1.1 1.2 1.3 10 12 14 16 18 20 22 

mean of Mtot (10^^ Mgun) mean of a (kpc) 

Fig. 1.— Empirical distribution of Mtot estimates from simulated data analysis. Black points and red dashed lines show the mean 

of the estimates and the true value of the parameter respectively. The standard deviation of the estimates is 0.07 and 1.77 for Mtot and a 

DF of the assumed model and the DF of the data are the 
same. For example, the true M(r) curve fell within the 
75% credible region seventy-five percent of the time over 
the course of the 500 analyses for scenario H 
In scenario we randomly remove 50 uys from each 
data set, andtreat them as parameters in the analy¬ 
sis. We find a very small positive bias in both the 
Mtot and a estimates, as shown in the top two panels in 
Fig.i The bias is insignificant, as the means of the esti¬ 
mates (1.01 X IO^^Mq and 15.2kpc) are within one stan¬ 
dard deviation of the distribution (0.08 x IO^^Mq and 
l.Qkpc respectively). The slight although insignificant 
positive bias suggests that the median may be a better 
estimate of the mass than the mean, but the median is 
almost identical to the mean in all cases. 

In scenario [^recall that an isotropic model is assumed, 
but the data sets in scenario have constant anisotropic 
velocity dispersions /3 = 0.5. Despite the data and model 
having different DFs, the estimates show only a slight 
positive bias; the true Mtot and a are still within one 
standard deviation of the distribution (see Fig. 1^. The 
mean of the estimates for Mtot and a are 1.01 f^0.08 x 
lO^^MQand 15.3 ± l.Skpc respectively. 

Examples of mass prohle credible regions from scenar¬ 
ios!^ and presented in Fig. Note that the intro¬ 

duction ofvt parameters tends to increase the width of 
the credible regions at all r values compared to Fig. 

In scenario!^ we find the credible regions to be slightly 
over confident for values of 17 < r < 35kpc, with the 
true M{r) falling within the 50, 75, and 95% regions 48, 
73, and 93% of the time over the 500 analyses. At all 
other r values, however, the credible regions are reliable. 
In scenario the credible regions are slightly lower than 
the true cumulative mass profile; the opposite is the case 


respectively. 



r (kpc) 

Fig. 2.— Example of predicted and true mass profile from anal¬ 
ysis of simulated data. The true M{r) profile is shown in red, and 
the 50, 75, and 96% credible regions are shown as shades of teal. 
Note: this is the result from one analysis (i.e. one data set). 


M{r) at several different r values, for every set of param¬ 
eters in the Markov chain. The true M(r) profile is the 
solid red line, calculated from eq.j^and the true Mtot and 
a. We find that the credible regions are reliable when the 
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mean of M,ot (10^^ ) mean of a (kpc) 


Fig. 3.— Empirical distribution of Mtot and a estimates from simulated data analysis, with 50 tangential velocities removed. The top 
panels are for scenario § and the bottom, §. The black points and red dashed lines show the mean of the estimates and the true value 
of the parameters respectively. The standard deviations of the estimates in scenario (|^ are 0.08 x 10^^Mq and l.Okpc, while the standard 
deviations are 0.08 x 1O^^M0 and 1.8kpc in |3|. 
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Fig. 4.— Example cumulative mass profile when 50 ut’s are unknown, and an isotropic Hernquist model is assumed. The true profile is 
shown as a solid red line, and the credible regions are shown as shades of teal. The left profile is scenario § and the right is scenario §• 
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for scenario!^ but for both the true curve still lies in the 
75% credible region for most r (see Fig. We reiter¬ 
ate, however, that Fig. are examples oiAI (r) profiles 
from a single data set and analysis. Over 500 analyses 
we find that the 50, 75, and 95% credible regions do con¬ 
tain the true M(r) 50, 75, and 95% of the time, in both 
scenarios and 1^ for almost all r. 

5.2. Milky Way Results 

Assuming an isotropic Hernquist model, and using all 
the kinematic data from Table we find a mean Mtot of 
1.55±0.08x 10^^ Mq and a scale radius of 16.9±2.3 kpc, 
where the uncertainties are the standard deviations of 
the parameters in the Markov chain. The 95% credible 
regions for Mtot and a are (1.42,1.73) x 10^^ Mq and 
(12.8,21.7) kpc respectively. We also report the mean 
Mtot and scale radius, with uncertainties of one stan¬ 
dard deviation, when the other models are assumed (Ta¬ 
ble . The mass estimates and scale radii vary only 
slightly between Hernquist models. The Jaffe model’s 
mass is similar, but the Jaffe scale radius radius cannot 
be compared directly to that of Hernquist because they 
have physically different definitions. 

The mass profile credible regions are shown in Fig. 
The innermost dark region corresponds to the 50% credl 
ible region. The vertical dashed lines show the extent 
of the data, with NGC 6540 and Leo 1 being the clos¬ 
est and furthest objects from the Galactic center respec¬ 
tively. The mass contained within the distance of Leo 1 is 
l-37lo;^Q ^ IO^^Mq, where the uncertainties correspond 
to the 95% credible interval. 



Fig. 5.— Mass profile credible regions assuming a Hernquist 
model with isotropic velocity dispersion. The dashed lines indi¬ 
cate the location of NGC 6540 and Leo I (the closest and furthest 
objects from the Galactic center respectively in our dataset). 


Some satellites may have a large effect on the mass 
estimate of the Galaxy. Leo I, for example, remained 
a contentious object for many years, because it is at a 
large distance from the Galactic center and it was un¬ 
clear wh ether or not it is bound to th e MW. Recently, 
however, ||Boylan-Kolchin et al. (2013) showed that Leo 
I is likely bound to the MW. i'urthermore, when Leo 
I’s proper motion is taken into account, the ob ject has 


little effect on the m ass estimate of the Galaxy (Wilkin¬ 


son & Evans 1999). We run our analysis assuniing the 


isotropic Hernquist model both with and without Leo 
I, and we also find that it has no effect within error 
on Mtot- When Leo I is removed from the analysis, 
Mtot = 1-52 ± 0.07 X IO^^Mq and a = 16.2 ± 2.2 kpc, 
very similar to the values obtained when Leo I is present. 

The five other dwarf galaxies in our data set that have 
measured proper motions are Draco, Fornax, Sculptor, 
and the Large and Small Magellanic Clouds (hereafter 
LMC and SMC). We obtain parameter estimates as¬ 
suming an isotropic Hernquist model with each of these 
dwarfs removed, and find that the mass and scale radii 
do not change within error in any case. 

Another object to consider is Sagittarius. In Section]^ 
we argued that vios ~ Vr for tracers with jcosyl > 0.95, 
but Sagittarius is relatively close-by at 16 kpc and has 
jcosql = 0.93, so one may question the inclusion of this 
object. However, once again we find little change in 
the mass estimate without it, for all models. For ex¬ 
ample, the isotropic Hernquist model returned Mtot = 
1.55 ± 0.08 X IO^^Mq, which is almost identical to the 
result obtained using all the data (see Tablej^. The scale 
radius is also unchanged within error (17.0^ 2.2kpc). 

We also investigate the effects on the mass estimate 
when tangential velocities are treated as parameters. 
To do this, we first obtain a mass estimate using only 
the Dinescu data (i.e. using only objects with com¬ 
plete velocity vectors), and find a slightly lower mass of 
1.47 ± 0.08 X 10^^ -^^0- Next, we remove five tangential 
velocities from the data, and repeat the analysis treating 
those missing vtS as parameters. Repeating this process 
and removing five different VtS each time, we find that 
the VtS cannot be well estimated. However, treating vtS 
as parameters has little to no effect on the mass estimate, 
within error. There is one exception to the latter state¬ 
ment: when Pal 3’s tangential velocity was removed, the 
mass estimate was reduced significantly. 

To investigate the influence of Pal 3 further, we per¬ 
formed an analysis using only the Dinescu data, but with¬ 
out Pal 3’s Vt value. Treating Pal 3’s Vt as a parameter, 
the mass estimate of the Milky Way fell by more than 
50% {Mtot = 0.76 ± 0.06 x lO^^M©). We also ran the 
analysis using only the Dinescu data, but without any 
Ut’s. In this case, Mtot = 0.8 ± 0.1 x IO^^Mq, which is 
similar to the estimate obtained in the former analysis. 
We note, however, that Pal 3 has the most uncertain vt 
measurement in the list (Table [^. It is evident that in¬ 
cluding measurement uncertainties in the analysis would 
reduce its leverage considerably. 

Using all kinematic data, but removing Pal 3 from the 
analysis, also resulted in reduced mass estimates. Fur¬ 
thermore, the effect is observed regardless of the selected 
model (Table [^. Thus, Pal 3’s proper motion, and in¬ 
deed Pal 3 in general, has significant influence on the 
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TABLE 2 

Parameter Estimates for the Milky Way. 



All Data 

Without Pal 3 

Without Draco 

Model - 

Aftot 

Scale Radius 
(kpc) 

Aftot 

(lOi^Mp,) 

Scale Radius 
(kpc) 

Mtot 

(lOi^Mft)) 

Scale Radius 
(kpc) 

H - iso 

H - OM 

H - ^ = 0.5 
Jaffe - iso 

1.55 ± 0.08 
1.52 ± 0.08 
1.47 ± 0.07 
1.61 ± 0.09 

16.9 ± 2.3 

16.7 ± 2.2 

12.1 ± 1.9 

47.7 ± 8.5 

1.36 ± 0.07 
1.34 ± 0.06 
1.31 ± 0.06 
1.38 ± 0.06 

16.7 ± 2.2 

16.4 ± 2.1 

12.3 ± 1.9 
48.9 ± 8.7 

1.55 ± 0.08 
1.52 ± 0.08 
1.46 ± 0.07 
1.57 ± 0.08 

16.8 ± 2.3 

16.6 ± 2.2 

11.9 ± 1.8 
45.2 ± 8.2 


Note. — In the first column, the first three models are of the Hernquist type, with isotropic, OM-type anisotropy, and constant 
anisotropy. The last row shows the results of assuming an isotropic Jaffe model. Uncertainties are one standard deviation of the posterior 
distribution. 



Fig. 6. — Satellite energies given the model parameters from the 
isotropic Hernquist model fit. Satellites without tangential veloci¬ 
ties are shown as open circles, and are plotted using the estimated 
vt value from the model fit. Unbound (escaping) objects would lie 
above the dotted line. 


mass estimate of the Gal axy. This issue r e gardin g Pal 
3 confirms the finding of Sakamoto et al.| (2003), who 
noted that high-velocity objects such as Pal 3 and Draco 
can have a significant effect on the mass estimate of the 
Galaxy. As mentioned previously, we also test the effect 
of Draco on the mass estimate and find that it has little 
effect on the mass estimate (Table [^. 

To demonstrate the effectiveness of using vt's as pa¬ 
rameters, consider Figure Using eq. and the mean 
parameter values from the isotropic Hernquist model fit, 
we plot the negative of the tracers’ specific energies as a 
function of r. Filled points are satellite data with com¬ 
plete velocity vectors, and hollow points are data with 
unknown ut’s (plotted using the mean vt estimates from 
the Markov chain). As demonstrated by our simulations 
and tests with the Dinescu data, the Vt values cannot be 
well estimated. However, the Vt parameters appear to 
converge to an average Vt for a given r value, and this is 
reflected in the positions of the hollow points in Fig. 


The results demonstrate that in a small sample of data, 
some objects carry greater influence on the mass estimate 
than others. Furthermore, the variation in these results 
implies that it would be fruitful to weight the data by 
their measurement uncertainties. In a Bayesian analy¬ 
sis, however, a fully hierarchical approach is necessary to 
properly include the measurement uncertainties of the 
data, and a probability distribution for the errors must 
be assumed. We leave this analysis to a future paper, 
and instead perform three more approximate sensitivity 
analyses. 

The first two sensitivity analyses are extreme cases: 1) 
all the tangential velocities are increased by 2Avt, and 2) 
all the tangential velocities are decreased by 2Avt ■ In the 
third and more realistic sensitivity analysis, we randomly 
change each Vt into a new tangential velocity Vt^new via, 

Vt,new = Vt + N{0,Avt) (25) 

where N(0, Avt) represents a random number drawn 
from a normal dist ribution with mean zero and variance 
Avt- Using eq. we generate 100 data sets with new 
Vt values, and then analyze these data sets assuming the 
isotropic Hernquist model. The estimates of Mtot and a 
from the 100 analyses have a distribution that confirms 
the results of the original analysis (Fig. [^; the mean of 
the estimates is nearly identical to the result in Table 

The results of the sensitivity analyses show that a 
proper treatment of the measurement uncertainties is 
worth pursuing. In future analyses, we plan to incor¬ 
porate the measurement uncertainties of the data via a 
hierarchical model. 

6 . DISGUSSION AND FUTURE PROSPECTS 

The results of this study are promising. Not only does 
the Bayesian analysis provide an effective way of incorpo¬ 
rating complete and incomplete data, but it also enables 
easy calculation of probabilistic credible regions for the 
cumulative mass profile. Eurthermore, even though this 
is a preliminary analysis, and mostly meant to lay the 
groundwork for future studies, our total mass estimates 
are similar to other studies that use different methods. 

Because our method returns a sample of parameter 
values representing the posterior distribution, it is easy 
to compare our results with mass estimates, at any radii, 
obtained in other studies. We can compute M(r) credible 
regions from our Markov chain at any r value, and obtain 
a mass estimate at that radius, with uncertainties. 

A collection of total mass estimates within r = 
lOOkpc, from 10 different studies, has been compiled 
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mean of Mtot (10^^ Msun) 



16.4 16.6 16.8 17.0 17.2 17.4 

mean of a (kpc) 


Fig. 7.— Distribution of Mtot and a estimates from the third sensitivity analysis. The black dots show the mean of the estimates. The 
value of the mean and the standard deviation of the empirical distribution are shown in the legend. 


by Courteau et al. ( |2014[ ). 
this radius assuming 


Our mass estimate at 
he isotropic Hernquist model is 


Mioo = 1.14 X lO^^MQwith a 95% credible region of 
(1.05, 1.26 )XIO^^Mq, which is within the range of val¬ 
ues listed in the review. 


Watkins et al. (2010) find the mass within 300 kpc 


to be 0.9 ± 0.3 x for an isotropic model. 

Our estimate M 300 assuming an isotropic model is 
1.39 X lO^^MQwith a 95% credible region of (1.29, 
1.53 )XIO^^Mq. When they consider an anisotropic ve¬ 
locity distribution with /? derived from the observational 
data, however, they find 3.4 ± 0.9 x IO^^Mq, in contrast 
to our /? = 0.5 constant anisotropic model that gives 
-^300 = 1-35 X IO^^Mq, with a 95% credible interval of 
( 1.27,1.51) x10^^M(T) 

Deason et al. ( 2013| ) used BHB stars to trace the MW’s 


mass, and found M{r = 50kpc) to be approximately 
4 X IO^^Mq, assuming a model of constant anisotropy 
with P = 0.5. However, our mass estimate for the MW at 
50kpc, using the Hernquist constant anisotropic model, 
is substantially higher at 9.5 x IO^^Mq with a 95% cred¬ 
ible interval of (8.5,11.0) x IO^^Mq. Even removing Pal 
3 from the data set does not lower this estimate signifi¬ 
cantly, reducing it to 8.5 x lO^^MQwith a 95% credible 
interval of (7.5, 9.6) x IO^^Mq. 

Using a tru ncated, flat rotation curve model, [Battaglia| 


et al. (2005) found the mass of the MW dark matter 


halo to be 1.2^J'® x IO^^Mq , and with an NEW model 


found a virial radius of 0 . 8^()'2 x IO^^Mq. Boylan- 


Kolchin et al. (2013) estimate the MW virial mass at 


1.6 X IO^^Mq, with a 90% confidence interval of 1.0 
to 2.4 X IO^^Mq. Sohn et al. (2013) use the timing 
argument of Leo 1 to arrive at a virial mass es timate 
= 3.15l(j^ X IO^^Mq. |Li & White] (|2008|) found 
the virial mass to be 2.4 x with a lower 95% 

confidence level of 0.8 x IO^^Mq. Thus, our preliminary 
results are on par with many other studies that use dif¬ 
ferent methodologies. 


To o ur kn owledge, no other studies besides Sakamoto 


et al. (2003) have found Pal 3 to carry so much weight 


in the analysis. Pal 3’s proper motion is already known 
(though with large uncertainties) and the satellite does 
not lie as far from the Galactic center as Leo I and other 
satellite dwarfs, which may have allowed its effect to go 
unnoticed. Removing Pal 3’s true Vt from the analysis 
and treating it as a parameter lowered the mass signif¬ 
icantly, suggesting that the tangential velocity is in the 
tail of the Vt distribution at r = 84kpc. The dwarf galax¬ 
ies in our data set, on the other hand, seem to have little 
individual influence on the mass estimate of the Galaxy 
even though some have high velocities and large distances 
from the galactic center. 

Many improvements, challenges, and exploratory anal¬ 
yses remain: 


1. One way to substantially improve the analysis is 
to incorporate measurement uncertainties via a hi¬ 
erarchical model, rather than the preliminary sen¬ 
sitivity analysis performed here. In the Bayesian 
paradigm, a probability density function of the 
measurement errors must be assumed. For exam¬ 
ple, for each data point the known measurement 













































































13 


uncertainty may be used to define the variance of 
a Gaussian distribution centered on the measure¬ 
ment value. Objects with high influence when mea¬ 
surement errors are ignored (e.g. Pal 3) might have 
reduced influence when measurement errors are in¬ 
cluded. 


2. An immediate challenge is finding models with dis¬ 
tribution functions that are tractable and that de¬ 
scribe the Milky Way in a more sophisticated man¬ 
ner. Using the NFW model is of particular interest 
because it is known to ht the dark matter halos of 
galaxies on many different scales, as well as groups 
and clusters of galaxies. The NFW DF, however, 
is not analytic. Although numerical solutions for 
the NFW DF have been derived, applying them in 
the Bayesian framework is more difficult because of 
the model’s infinite mass. When a model’s DF does 
not integrate to a finite mass (eq. then it is not 
a proper probability distribution— a requirement 
when applying Bayes’ theorem. We are currently 
exploring the problem through an Approximate 
Bayesian Computation (ABC) algorithm, which al¬ 
lows for calculations of posterior distributions with¬ 
out explicit calculation of the likelihood. 


3. The DFs for all of the models employed here 
are analytic. Furthermore, the models are self- 
consistent— i.e. we implicitly assume that the dark 
matter and the satellites follow the same distribu¬ 
tion. However, it is possible that the tracers (GCs 
and dwarf galaxies) have a different distribution 
than the dark matter halo particles. For example, 
the tracers may have a Hernquist-type density pro¬ 
file ph(?') given by eq.^ but may reside in an NFW 
gravitational potentiaTgiven by 

^»pw(.) = -4w; ‘° » + (26) 

r/Ts 


In this situation, there are two extra parameters, 
Tg and po, which correspond to the scale length and 
density parameter of the dark matter halo. We can 
derive the DF for such a model via the Eddington 
formula: 


/(£) = 


1 


1 


■y/STT^ 


(fP. 


0 




Vs 


(27) 


where = $ — $0 is th e relative potential (see|Bin- 
ney fc Tremaine|2008 1 . For the Hernquist model, p 
can be written as an analytic function of dt, and the 
integral can be evaluated in closed form. For the 
case at hand, however, the relation between pn and 
'I'nfw is a transcendental equation. Nevertheless, 
the integral required in the Eddington formula can 
be evaluated numerically. Widrow et al. (2008), 
for example, used this method to derive DFs tor 
their self-consistent disk-bulge-halo galaxy models. 
A numerically derived DF may still be used with 
our method, as long as it is a normalized probabil¬ 
ity distribution. We plan to implement models of 
this type in future studies of tracer populations. 


4. Different velocity anisotropy formalisms are also of 
interest. For example, other Hernquist model DFs 
of diffe rent anisotropies are diyussed by |Cuddeford| 
(1991) and Baes & Dejonghe (2002). The former 
derived a velocity anisotropy that is a generaliza¬ 
tion of the OM-type anisotropy, where another pa¬ 
rameter Bn is introduced such that 


B{r) = 


■Pori 


(28) 


When Bo 
anisotrop’ 

= 0, eq. 
f. As r„, 00 ^ 

reduces to OM-type 
(r) — >■ Bo, in constrast 

to eq. |7| 

Baes & Dejonghe 

(20021 derive a Hern- 

quist DI' 

using this formalism, and show that only 


four values of Bo lead to DFs that can be expressed 
in terms of elementary functions. The simplest of 
these DFs occurs when Bo — 0.5, while the other 
DFs ar e ” ...somewhat more ela borate” and not pro¬ 
vided ( Baes fc Dejonghe]|2002 ). 


5. We will explore biases that may occur due to selec¬ 
tion effects. In the Hernquist simulations used here, 
some tracers are unrealistically far from the Galac¬ 
tic center (e.g. more than 500kpc away), while our 
kinematic data set has a range from r = 3kpc to 
261 kpc. Imposing a more realistic range on sim¬ 
ulated data may or may not introdnce biases in 
parameter estimates. 


6 . Further along the line, it will be exciting to apply 
the method presented here to large datas ets of field 
halo stars, le ading up to the GAIA data. [Sakamoto] 


et al. (2003) showed that including many field hor¬ 
izontal branch stars greatly reduced the effect of 
high-velocity objects (such as Draco and Pal 3) on 
the mass estimate of the Milky Way. Therefore, it 
can be expected that the accurate and abundant 
kinematic data from GAIA will also improve our 
mass estimates in a major way and decrease the 
effect of outliers. 


7. The method outlined in this paper could also be 
extended to obtain mass estimates of other galaxies 
for which tracer objects will have only the projected 
positions and line of sight velocities. 

SUMMARY 

We have introduced a method to estimate the mass of 
the Milky Way that incorporates both complete and in¬ 
complete data for positions and velocities of tracers. The 
method treats unknown tangential velocities as parame¬ 
ters in the model. Simulations showed that although the 
tangential velocities cannot be well constrained, treating 
VtS as parameters has little effect on the mass estimate 
when other complete velocity vectors are available. An 
exception does occur, however, when a tracer has an un¬ 
usually extreme tangential velocity (e.g. Pal 3). 

Under simple assumptions of a Hernquist-like halo po¬ 
tential and modest anisotropy, we find Mt^t ~ 1.3 —1.5 x 
IO^^Mq, in good agreement with other recent work. For 
an isotropic model we find Mtot = 1-55 x IO^^Mq with 
a 95% credible interval of (1.42,1.73) x IO^^Mq, and a 
scale radius of a = 16.9kpc. We also report the mass 
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contained within 260kpc: 1.37 x IO^^Mq, with a 95% 
credible interval of (1.27,1.51) x IO^^Mq. 

In future research, we will be incorporating measure¬ 
ment uncertainties into the analysis and will test more 
extensively for parameter biases. We also plan to use the 
NEW model and find other DE’s to use in the GME code. 
The method outlined here will eventually be applied to 
extragalactic studies, where the complete velocity vectors 
of the tracers are unknown. 
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